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Abstract 

Biological and machine pattern recognition systems face a common challenge: Given sensory data 
about an unknown object, classify the object by comparing the sensory data with a library of internal 
representations stored in memory. In many cases of interest, the number of patterns to be discriminated 
and the richness of the raw data force recognition systems to internally represent memory and sensory 
information in a compressed format. However, these representations must preserve enough information to 
accommodate the variability and complexity of the environment, or else recognition will be unreliable. 
Thus, there is an intrinsic tradeoff between the amount of resources devoted to data representation and the 
complexity of the environment in which a recognition system may reliably operate. 

In this paper we describe a general mathematical model for pattern recognition systems subject to 
resource constraints, and show how the aforementioned resource-complexity tradeoff can be characterized 
in terms of three rates related to number of bits available for representing memory and sensory data, and the 
number of patterns populating a given statistical environment. We prove single-letter information theoretic 
bounds governing the achievable rates, and illustrate the theory by analyzing the elementary cases where 
the pattern data is either binary or Gaussian. 

I. Introduction 

PATTERN recognition is the problem of inferring the nature of unknown objects from incoming and 
previously stored data. In real-world operating environments, the volume of raw data available often 
exceeds a recognition system's resources for data storage and representation. Consequently, data stored 
in memory only partially summarizes the properties of physical objects, and internal representations of 
incoming sensory data are likewise imperfect approximations. In other words, pattern recognition with 
physical systems is frequently a problem of inference from compressed data. However, excessive data 
compression precludes reliable pattern recognition. In this paper we attempt to answer the following 
question: In a given environment, what are the least amounts of memory data and sensory data consistent 
with reliable pattern recognition? 

The paper is organized as follows. In section[J]]we introduce the general problem qualitatively. Relationships 
between the present work and other pattern recognition research is briefly described in section||ll] In section 
[V]we formalize our problem as that of determining which combinations of three key rates are achievable, 
that is, which rate combinations are consistent with the possibility of reliable pattern recognition. These 
rates are directly related to number of bits available for representing memory and sensory data, and the 
number of distinct patterns which the recognition system must be able to discriminate. The main results of 
the paper are single letter formulas providing inner and outer bounds on the set of achievable rates, given 
in section I VII and discussed in section I VIII The theory is illustrated by applying it to the Binary case in 
section IV 11 II and the Gaussian case in IIXI 

II. Informal problem description 

In general, statistical pattern recognition problems may be specified in terms of a probabilistic model of 
the environment ('nature') , a pattern recognition system; and the interactions of the system with the 
environment during two distinct modes of operation, a training ('offline') phase and a testing ('online') 
phase. Informal descriptions for the environment and system models we study are given below, and 
formalized in section [V] Our model and viewpoint are similar to others in the statistical pattern recognition 
literature (see, e.g. [6], [8], [10], [14], [21]), but fits most closely within the framework of Pattern Theory 
(see e.g. [12], [19], [20], [23], [24]). Please refer to the block diagram in figure ^ while reading the 
following description. 

This work was supported by the Mathers Foundation and by the Office of Naval Research. 

'Non-probabilistic models have also been considered. Arguments for preferring the probabilistic formulation are discussed in [25]. 
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Fig. 1. Block diagram for a generic pattern recognition system. 



A. Environment 



Training patterns and the training phase. The environment for a pattern recognition system is denned as 
the set of distinct entities that the system must learn to reliably distinguish. These entities are hereafter 
referred to simply as patterns, and may include, for example, distinct physical objects, properties of 
objects, or arrangements of multiple objects. We assume each pattern can be represented by an n— vector 
x = (xi, X2, ■ ■ ■ , x n ) whose elements take values in some alphabet X. Of the 1^1™ possible patterns, 
the environment contains only a small subset {X(1),X(2), . . . ,X(M C )}, M c <C \X\ n . However, before 
entering the environment, the system does not know which specific patterns will be present, but rather 
knows only their number M c and that they are generated according to some probability distribution p(x). 

After being introduced into the environment, the system initially enters the training phase. During training 
the system attempts to form and store an internal representation (memory) of each pattern along with 
a semantic label, uu G Ai c = {1, 2, . . . , M c }. In concrete terms, the labels might correspond to a set 
of actions the system should undertake when it encounters each pattern, 'pointers' to additional stored 
information, or 'names' for the patterns. For simplicity, we take the labels to be integers, and denote the 
training set by C x = {(X(l), 1), (X(2), 2), . . . , (X(M C ), M c )}. 

Observations and the testing phase. After the training phase, the system enters an 'online' testing phase. 
During testing the observed data is generated as follows. Nature randomly selects a pattern W according 
to some distribution p(w),w G M Cl retrieves the corresponding pattern x(W) E C x , and subjects it to a 
random transformation p(y\x) to produce a signal y = (yi,y2, ■ ■ ■ ,y n ) with elements in some alphabet 
y. The patterns x in C x thus represent 'pure signals' or prototypes, and the observations ye} 1 represent 
distorted and noise-corrupted variations or signatures of the underlying patterns. The random map p(y|x) 
models two major intrinsic sources of difficulty in real-world pattern recognition problems: signature 
variation, differences between the sensory signals generated on different occasions by the same underlying 
object; and signature ambiguities, the fact that distinct objects often produce similar or identical signatures 2 . 

2 Grenander [12] and Mumford [24] have argued that four 'universal transformations' (noise and blur, superposition, domain 
warping, and interruptions) account for most of the ambiguity and variability in naturally occurring signals. 
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B. Recognition system 

A recognition system consists of three components (functions): A memory encoder /; a sensory encoder 
<j>; and a classifier. Since we assume the system must be designed prior to insertion into its environment, 
the functions (/, <f), g) must be defined independent of the specific realizations of the training data C x and 
sensory data encountered during online operation. On the other hand, the system design can take account 
of statistical information about the environment, i.e. knowledge of the distributions p(x) and p(y\x). 

Encoders. The memory and sensory encoders / and <f> are mappings from the domains of the raw training 
and sensory data, respectively, into some form of approximate internal representations. Encoding may 
comprise several distinct operations, such as smoothing and noise reduction, segmentation, normalization, 
dimensionality reduction, etc., often collectively referred to as 'feature extraction' procedures [14]. In 
principle, the role of the resulting internal data representations may be played by any distinct set of 
physical configurations or 'states' of the system, provided that mechanisms exist for associating the training 
data with these memory states; inducing appropriate internal states from the sensory data; and retrieving 
memorized data, comparing it with compressed sensory data, and reporting a recognition decision. 

Conceptually, we can alternatively regard the internal states of the system as 'codewords,' denoted C u = 
{u(l),u(2), . . . ,u(M x )} for the memory encoder; and C v = {v(l),v(2), . . . ,v(M y )} for the sensory 
encoder, where the codeword alphabets U and V are dictated by the physical nature of the system's 
memory and sensory systems. 

The sensory encoder is then defined as a mapping from the entire observation space onto the indices 
A4 y = {1, 2, . . . , M y } of the sensory codebook </> : y n ^M y , </>(y) = fi, or equivalently, onto the 
codewords C v . The memory encoder / is similar, except that it receives labeled inputs and produces 
labeled outputs: Given a labeled training pattern (x(W),W), f associates to it both a memory index 
m G M x = {1,2,..., M x } and reproduces the class label w G M c , representing its storage in memory. 
Thus, / is a mapping from the product of the entire training data space and the set of training labels onto 
the product of the memory indices and class labels / : X n x M. C -^>M. X x A4 C , f(x,w) — (m,w). 

Classifier. The classifier, g, attempts to infer the class label of an encountered pattern on the basis of 
the compressed sensory information and data stored in memory. Abstractly, the inference process may 
take be thought of as a search through the codebook C u for the memory codeword best matching the 
current sensory codeword v G C v . Physical implementations of the matching process may take the form 
of computational algorithms; the dynamics of some physical medium (e.g. a biological neural network); 
or an abstract decision rule. Mathematically, a classifier is a mapping g from the encoded sensory data 
M = 4>(y) G -My and the memory data C u to a class label w G M c , i.e. g : M y x C U ^M C , g(p,,C u ) = w. 

C. Figures of merit 

For given distributions p(x) and p(y\x) and data dimension n, there is clearly an intrinsic tradeoff between 
the number of internal memory and sensory states, M x and M y , and the number of patterns M c that can be 
reliably recognized. For our purposes it is preferable to characterize this tradeoff in a dimensionless manner, 
that is, in terms of rates. The rates of the memory and sensory encoders / and 4> are given respectively by 
R x = log 2 M x /n, R y — log 2 My/n, where standard interpretations apply (see, e.g. [?], [?], [?]): Viewing 
the indices of the memory codebook M x = {1,2,..., M x } as binary strings of length N x = log 2 M x , 
the rate R x is simply the cost, in bits/symbol, of representing each n— length training pattern x G X n by 
a length- N x binary string, R x — N x /n. The analogous interpretation applies to the sensory codebook. We 
also quantify the amount of data in the training set by defining a rate R c = log 2 M c /n, interpreted as the 
number of training patterns discriminated per-symbol of encoded memory and sensory data. 

D. The meaning of large n 

Some of the results below (specifically, the 'achievability' proofs) rely on asymptotic arguments, requiring 
the parameter n to grow large. Physically, iarge-n' may correspond to representing the sensory and memory 
data at high resolution; collecting more of it; or making repeated measurements [28]. On the other hand, 
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though our proofs employ asymptotic arguments, the theorems themselves are stated in terms of single letter 
formulas, and in this sense they are independent of n. Hence, the 'large-n' assumption in the achievability 
proofs is not necessarily a fundamental limitation of the theory. 

III. Related issues 

Before formalizing our problem, we briefly comment on some relationships between the present work and 
other issues in pattern recognition. 

Probabilistic modeling. Our analysis supposes the existence of probabilistic models for the recognition 
environment, and that these distributions are available for use in designing the recognition system. For 
some types of random patterns, such as the pattern of grains on a wooden surface or of magnetic particles 
on magnetic tape, estimating the probability distributions is relatively straightforward [28]. Substantial 
progress has also been made in modeling more challenging objects, such as textures in natural imagery 
[7], [13], [26], [29], [31], and speech signals [15]. Nevertheless, in many cases of interest the development 
of accurate probabilistic models remains a challenge, and is an active research focus in pattern recognition 
research. 

Data compression. The importance of data compression in pattern recognition systems appears most clearly 
articulated in the neuroscience literature, due largely to the pioneering work of Horace Barlow. Barlow 
has written extensively about experimental evidence and theoretical reasons for believing that principles 
of efficient data compression underly the capacity of animal brains for learning and intelligent behavior 
(see, e.g. [l]-[4], [11]). Additionally, in the past few decades much additional work in neurobiology has 
provided experimental evidence for efficient coding mechanisms in the sensory systems of diverse animals, 
including monkeys, cats, frogs, crickets, and flies [27]. More recently, data compression has come to be 
viewed as essential for managing metabolic energy costs in animal brains [22]. 

In the engineering pattern recognition literature, data compression usually arises in the context of feature 
extraction. Feature extractors are typically designed with the objectives of transforming the raw data 
available to the system into a format which facilitates easy matching or storage, and is robust ("invariant") 
with respect to characteristic signature variations in sensory data [14], [21]. With respect to these goals, 
the volume of data used for internal data representations is present as an implicit constraint, since efficient 
data manipulation is often best achieved by compact representations. For complex environments, the cost 
of data representation becomes critical as inexplicit design constraint. Whatever the motivations are, the 
crucial common aspect of all data encoding operations for our present purposes is that they reduce the 
amount of data available to the system as compared with the original data (usually in a lossy manner). 

Performance prediction vs. normalization. Performance prediction is the problem of characterizing the 
performance for specific classes of recognition systems, often with the goal of discovering the optimal 
member (e.g. best parameter settings) of a given class [8]. By contrast, our objective is to characterize the 
requirements for the existence of reliable pattern recognition systems, and to describe absolute performance 
limits governing all such systems. In this sense, we aim to provide normalized performance bounds, with 
respect to which the performance of any actual or proposed recognition system may be evaluated. 

IV. Notation 

We adopt the following notational conventions. Random variables are denoted by capital letters (e.g. U), 
and their values by lowercase letters (e.g. u). The alphabet in which a random variable takes values is 
denoted by a script capital letter (e.g. U). Sequences of symbols are denoted either by boldface letters 
or with a superscript, interchangeably (e.g. u = u n = (ui, 1*2, . . . , u n ) denotes a vector which takes 
values in the product alphabet U n ). The probability mass function (p.m.f) for a random variable U E U 
is denoted by pu{u) 1 u ElA. When the appropriate subscript is clear from context, we omit it to simplify 
notation; e.g. we usually write pu(u) simply as p(u). Given random variables U,V,W, we denote the 
entropy of U by H(U), the mutual information between U and V by I(U; V), and the conditional mutual 
information between U and V given W by I(U;V\W). The standard acronym 'i.i.d' will stand for the 
phrase 'independent and identically distributed.' To express statements like 'U and V are strongly jointly 
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delta typical' write (U, V) £ Tuv ■ The definition of strong (delta) joint typicality will be reviewed in the 
section where it first appears. Finally, to express statements like: X and Z are conditionally independent 
given Y, i.e. p(x,y,z) = p(y)p(x\y)p(z\y), we write 'X — Y — Z form a Markov chain,' or simply 
X-Y-Z. 

V. Formal problem statement 
Definition 5.1: The environment for a pattern recognition system, denoted by 

£ = (M c ,p(w),X,p(x),p(y\x),y), 

consists of three finite alphabets M c , X, y, probability distributions p(w) and p(x) over M c and X, and 
a collection of probability distributions p{y\x) on y, one for each x £ X. 

The interpretations are those given in the preceding section: M c = {1,2,..., M c } is the set of class labels; 
patterns vectors are written in the symbols of X; and sensory data vectors in the symbols of y. For our 
analysis we assume: 

• the distribution over class labels is uniform, p(w) = 1/\M C \ for all w £ M c ; 

• the pattern components are i.i.d., p(x) = Y\^=iP( x i)> 

• the observation channel is memoryless, p(y|x) = n"=i P{Vi\ x i)- 

Definition 5.2: An (M c , M x , M y ,n) pattern recognition code for an environment £ consists of three sets 
of integers 



M c 


= {1,2,.. 


■ ,M C } 


M x 


= {1,2,. 


■ ,M X } 


My 


= {1,2,. 


-.,My} 



a set of length— n sequences X(i) £ X n , i = 1,2,..., M c , where all components are drawn independently 
from p(x) and each sequence is paired with a distinct index from M c 

C x = {(X(l), 1), (X(2), 2), . . . , (X(M C ), M c )}; 

a memory encoder 

/ : X n x M C -^M X x M c ; /(x,w) = (m, w); 

a sensory data encoder 

: y n ^M y ; 0(y)= M ; 

and a classifier 

g : My x C U ^M C , g(n,C u ) = w 

composed of two submappings g = g 2 o gi 

gi : M y ^M x ; gi{y) = m 

32 : M x x C U ^M C ] g 2 (m,C u ) = w, 

where C u denotes the encoded training data 

C u = f(C x ) = {(m(l), 1), . . . , (m(M c ), M c )}. 

For convenience hereafter, we refer to an (M c , M x , M y , n) pattern recognition code by its three constituent 
mappings (/, 0, g) n , or simply as (/, 0, g) when the integer n is clear from context. 

The rate R = (R c , R x , R y ) of an (M c , M x , M y , n) code is 

Rc = - log 2 M c 
n 
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R< 

Rr 



- log 2 M x 
n 

~ 10g 2 My, 

n 



where the units are bits per symbol. 

For each pattern-label pair (x(w),w) £ C x , let rh(w) be the memory index assigned to x(w) by the 
memory encoder /, and let the corresponding sensory data be y. Define two error events 

£i(w) = {fh 7^ m(w)} 
£2{w) = {wj^w}, 

where m = gi{n) = gi{4>{y)) and w = g(fi,C u ) = g 2 {m,C u ) = 92{gi{4>{y)),C u )\ and denote the union 
by 

e{w) = £l{w) U £ 2 {w). 

During the testing phase of operation, if the pattern index w £ M c is selected, let 

P?{w)=Pi{e(w)} 

denote the probability of error. Note that these probabilities depend only on the random vectors X(u>) 
and Y and hence are determined by the joint distribution p(x, y) = p(x)p(y\x). We define the average 
probability of error of the code as 

Note that this probability is calculated under a uniform distribution on the pattern indices, p(w) = 1/M C . 
That is, we assume that every pattern index w £ A4 C , and hence every pattern X(u>), is selected with 
equal probability during the testing phase. 

Comment 5.3: Expanding the probability of error in two ways 

P e n = Fr{ £l U£ 2 } 

= Pr{ £l } + Pr{e c 1 }Pr{e 2 \£ c 1 } 
= Pr{e 2 } + Pr{e c 2 }Pr{£ 1 \£ c 2 }. 

we see that P" = if and only if 

Pr{ £l } = Pr{e 2 } = Pr{ £l \e c 2 } = Pr{e 2 \el} = 0. 

The interpretation is that in a reliable pattern recognition system both components g\ and g 2 of the classifier 
g must function reliably. 

Definition 5.4: A rate R = (R x , R y , R c ) is achievable in a recognition environment £ if for any e > 
and for all n sufficiently large, there exists an (M c , M x , M y ,n) code (f,<fi,g) n with 

M c > 2 nR " 
M T < 2 nR * 



M y < 2 nR y 



such that P™ < e. 



Definition 5.5: The achievable rate region 1Z for a recognition environment £ is the set of all achievable 
rate triples. 

The primary goal of this paper is to characterize the achievable rate region 1Z in a way that does not 
involve the unbounded parameter n, that is, to exhibit a single letter characterization of 1Z. 
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VI. Main results 

In this section we present inner and outer bounds on the achievable rate region 1Z. The bounds are expressed 
in terms of sets of 'auxiliary' random variable pairs UV, defined below. In these definitions we assume 
that U and V take values in finite alphabets U and V and have a well defined joint distribution with the 
'given' random variables XY. To each such pair of auxiliary random variables UV we associate a set of 
rates IZuv defined by 

1Z UV = {R : R X >I(U;X) 
R y >I(V;Y) 

R c < I(U;V) - I(U;V\X,Y).} 

Next, we define two sets of random variable pairs, 

V in = {UV : U-X-Y, 
X - Y - V, 
U-(X,Y)-V}. 

and 

■Pout = {UV : U-X-Y, 
X -Y -V}. 

When convenient hereafter, we express the three independence constraints in Vi n as a single 'long' Markov 
chain, U — X — Y — V . 

Finally, we define two additional sets of rates 

lZ m = {R : R G K uv for some UV G V in } 
TZout = {R : R G Kuv for some UV £ V out }- 

Comment 6.1: Note that for rates in lZi n , the long Markov constraint U — X — Y — V implies that the 
second term in the third inequality of IZjjv vanishes, i.e. I(U; V\XY) = 0. 

Our main results are the following. 

Theorem 6.2 (Positive theorem: Inner bound): 

n in c n 

That is, every rate R <E Hi n is achievable. 
Theorem 6.3 (Negative theorem: Outer bound): 

ltout 2 7?- 

That is, no rate R ^ !Z ou t is achievable. 
The proofs appear in Appendices [Q and |H] 

Remark 6.4: If either X — U or Y — V, or both, then the inner and outer bounds are identical, since in 
this case the extra Markov condition U — (X, Y) — V in the definition of Vi„ is automatically satisfied. 
For example, if U = X , then the condition is equivalent to I(U; V\XY) = I(X; V\XY) — 0, which is 
obviously true. Similar comments apply if U and V are any deterministic functions of X and Y. 

VII. Discussion of the main results 

A. The gap between bounds 

The true achievable rate region is sandwiched between the sets lZi„ and lZ ou t, i.e. lZi„ C 1Z C lZ ou t- 
The gap between lZi n and H ou t is due to the different independence constraints in the definitions of Vi n 
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and P ou t'- Whereas distributions in Vi n satisfy three Markov-chain constraints U — X — Y, X — Y — V, 
and U — (X, Y) — V or, equivalently, the single 'long chain' constraint U — X — Y — V, distributions in 
Tout need only satisfy the first two 'short chain' constraints. Hence, lZ ou t is the larger rate region and, in 
general, we expect a gap between the two regions. 

B. Convexity 

One manifestation of the difference between lZ ou t and 7Z in is that lZ ou t is convex, while lZ in generally 
is not. We state this here as a lemma: 

Lemma 7.1: lZ ou t is convex set, in the sense that all rates along the line connecting any two rates Ri and 
R.2 contained in lZ ou t are also contained in TZ ou t- 

The convexity of lZ ou t is proved in Appendix IIHI The nonconvexity of TZi n is apparent from the examples 
studied in sections [Villi and IT%I 

C. Berger's observation and implications 

At least in part, the reason for the gap can be appreciated more concretely using the following observation 
made by Berger when discussing the distributed source coding problem, for which the currently known 
inner and outer bounds on the achievable rates are separated by a similar gap [?]. Observe that the 
long-chain Markov constraint on P m implies that each corresponding joint distribution over UV given 
XY must factorize into a product of marginal distributions, p(u,v\x,y) = p(u\x)p(v\y) . By contrast, the 
less restrictive constraints on Tout admit pairs whose joint distributions are convex mixtures of product 
marginals; that is, distributions of the form 

p(uv\xy) = 22p(q)p(u\x,q)p(y\v,q). 

More explicitly, we can represent the set of all such auxiliary random variable pairs as follows. 
Definition 7.1: Let 

V mix ={UV:U= (U Q , Q), V = (V Q , Q)}, 

where Q is any discrete random variable with a finite alphabet Q which is independent of X and Y, and 
for each q £ Q the pair U q V q G Vi n . 

Clearly, there is potentially a much larger set of distributions for UV pairs in V m ix tnan m "Pin- 

However, while V m ix is clearly contained in Tout, it is unknown whether or under what conditions V m ix = 
Tout- Further, if we define the additional rate region 

'R-mix = {R : R G TZuv for some UV £ P m ix}, 

and let Co(1Zi n ) denote the convex hull of lZi n 

Co(K. m ) = {R : R = 0Ri + 0R 2 , Ri, R2 6 < 9 < 1} 

where 6=1 — 9, then it is easy to verify that the following logical statement holds: 

If Pout — fornix ( 1 ) 

then TZ out = ~rZ m ix = Co(7l in ). 

Thus, it is unknown whether the presence of mixture distributions in V ou t is enough to account for all of 
the gap between lZi n and lZ ou t- As discussed below in subsection IVII-FI ([Q has interesting implications 
for closing the gap. 

D. Relationship with distributed source coding 

Some interesting connections hold between the results of Tung and Berger [5], [30] for the distributed 
source coding (DSC) problem and our results in theorems 16.21 and 16.31 Briefly, the situation treated in the 
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DSC problem, diagrammed in figure |2j is as follows. Two correlated sequences, X and Y, are encoded 
separately as m — /(X), [i — <fi(Y), and the decoder g must reproduce the original sequences subject to a 
fidelity constraint, (Ed x (X., X), Ed y (Y, Y)) < D, whrere D = (D x , D y ). The problem is to characterize, 
for any given distortion D, the set of achieveable rates 1Z(D). 



Y 





Fig. 2. The distributed source coding problem. 



The known inner and outer bounds for the DSC problem are as follows. Let Vi n and V ut, be defined as 
above, and define two new sets incorporating the distortion constraint 

V in {T>) = P in nVuv(D) 
VoutCD) = PoutnVuv(D), 

where 

Puv(D) = {UV : 3X(U,V),Y(U,V) s.t. (Ed x (X,X), Ed v {Y, Y)) < D}. 

Parallelling equation [2 also define the sets of rates 

H UV = {R : R X >I(U;X\V) 
R y > I(V;Y\U) 
R x +R y >I{UV-XY).} 

and 

K m (D) = {R : R G TZuv f° r some UV 6 Pi n (D)} 
K out (D) = {R:R6% for some UV £V out {T))}. 

Then the Berger-Tung bounds for the DSC problem can be expressed as 7^ n (D) C 7?.(D), and 7^ OU i(D) D 

n(p). 

With the results presented in this way, the formal similarities between our pattern recognition problem and 
the DSC problem are obvious. Additionally, ignoring the distortion constraints for the moment, the pattern 
recognition problem can be thought of as a kind of generalization of the DSC problem, with the added 
complication that the 'decoder' receives not one sequence X but M c = 2 nRc such sequences, and must 
first determine which is the appropriate one with which to jointly decode the second received sequence Y. 
This extra discrimination evidently requires extra information to be included at the encoders. This 'rate 
excess' is the difference between the minimum encoding rates required for the DSC and pattern recognition 
problems. Using the the short-chain Markov constraints U — X — Y and X — Y — V, the rate excess for 
the X encoder is 

I(X;U) - I(X;U\V) = I(X;U) - I(XY;U\V) 

= I(X-U)-[I(XY-UV)-I{XY-V)] 
= I(X;U) + I(Y;V)-I(XY;UV) 
= I(U; V) -I(U; V\XY) 

and, by symmetry, at the Y encoder the excess required rate is 

I(Y; V) - I(Y; V\U) = I(U; V) - I(U; V\XY). 



Thus, the excess rate required at either terminal is directly related to the maximum number of patterns 
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that must be discriminated, M c = 2 nR °, R c = I(U; V) - I(U; V\XY). 

E. Extension of the inner bound 

The following results provide a way to reduce the gap between TZ ln and 7Z ou t 'from below,' by improving 
on the inner bound. 

Theorem 7.2: If the point R = (i? c , R x , R y ) is achievable, then for any < 8 < 1, the point R' = 6*R is 
achievable. 

Corollary 7.2: Let 

U' = {R : R = 6>R, R G Ki n , < 6 < 1}. 

Then ft' C TZ. 

The theorem and corollary are proved in Appendix IIVI As discussed in the next subsection, this extension 
of the inner bound may in some cases allow us to close the gap, i.e. in cases where the expression for the 
convex hull of TZi n simplifies such that Co(TZi n ) — TZ' . Specific examples where this appears to be the 
case include the binary and Gaussian examples discussed in sections IVI11I and fTxl 

F. On closing the gap 

What additional results would be needed to determine the true achievable rate region TZ7 To explore this 
question, consider the following hypothetical statements and their implications. 

(<0 out — T^mix 

(b) Co(K m ) = K' 

(c) TZ is convex 

(d) K = TZout 

We emphasize that none of these statements have been proven. Nevertheless, the following Lemmas, stated 
in 'if-then' form, are true. 

Lemma 7.3: (a),(b) => (d) 

Lemma 7.4: (a),(c) (d) 

The proof of Lemma 1731 is as follows. Assuming Co(TZi n ) = 7Z', then by corollary 17.21 the convex hull 
is achievable, Co(TZi n ) C 1Z. But by Q our assumption (a) implies 7Z ou t = TZ m i X — Co(1Zi n ), hence 
TZ ou t Q TZ. Combining this with theorem 16.31 we have 7Z out C 1Z and lZ out 2 TZ, or 1Z = lZ out . 

Lemma f7~4l follows from straightforward timesharing arguments, as shown in Appendix IV! 

Both Lemmas 17.31 and 17.41 suggest potential routes for establishing the true achievable rate region TZ by 
expanding the inner bound TZi„ . While we expect that premises (a) and (b) hold in certain cases, we suspect 
that they are not true in general; we have no current guess about (c). On the other hand, if TZ ou t is larger 
than IZmix, then it may still be possible to establish the true achievable rate region TZ by tightening the 
outer bound, possibly down to TZ m i X . Thus TZ ou t and TZ m i X are presently the most promising candidates 
for TZ. 

G. Degenerate cases 

We now briefly examine the degenerate cases where either X = U, or Y = V, or both. These simple cases 
have clear interpretations and are thus useful for building intuition about the general results of theorems 
16.21 and l6~3l Note that in these cases I(U; V\XY) = 0, hence the third inequality in the definition of TZjjv 
^ simplifies to R c < I(U ; V). Additionally, in these cases there is no gap, i.e. the inner and outer bounds 
are equal; see Remark l6~4l 

Unlimited senses and memory. First, consider a system in which the budgets for memory and sensory 
representations are unrestricted, i.e. no compression is required. In this case, we can effectively treat the 
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memories and sensory representations as if they were veridical; i.e. we can set U — X and V — Y. The 
theorem constraints then become R x > I(X; X) = H(X), R y > I(Y; Y) = H(Y), and 

R C <I{U-V) = I(X-Y). (2) 

This result indicates that, in the absence of compression, the recognition problem is formally equivalent to 
the following classical communication problem: Transmit one of M c = 2 nRa possible messages (patterns) 
to a receiver (the recognition module) [28]. In this case, the objects can be thought of as codewords which 
are stored without compression for direct comparison with the sensory data. This is the setup of the random 
coding proof of Shannon's channel coding theorem, which gives the rates at which reliable communication 
is possible as those below the mutual information between the source (analogous to the memory here) 
and the received signals, I(X;Y) [?], [?]. This is exactly the condition expressed by (0. The condition 
specifies an upper bound on the number of objects the system may be trained to recognize through the 
relation M c = 2 nR ". 

Unlimited memory, limited senses. Next, suppose that memory is effectively unlimited, so that we can put 
U = X, but sensory data may be compressed. In this case, we can readily rewrite the condition on R c as 

R C <I(X;Y)-I(X;Y\V). (3) 

We check the extreme cases: If Y is fully informative about V, Y = ^(V), then I(X; Y\V) = H(Y\V)- 
H (Y\X, V) = 0, and we recover the case discussed above. For intermediate cases where V is partially 
informative, then the effect of V is to degrade the achievable performance of the system below that 
possible with 'perfect senses,' and the reduction incurred is I(X; Y\V). In the extreme case that V is utterly 
uninformative (i.e. independent of Y), then I(X; Y\V) = I(X; Y), and we get R c = 0, or M c < 2 nRc = 1, 
hence the system is useless. 

Limited memory, unlimited senses. In the case of limited memory but unrestricted resources for sensory 
data representation, we get an expression symmetric with the previous case: 

Rc<I(X;Y)-I(X;Y\U). (4) 

As before, if the memory is perfect (U = X), we get I(X; Y\U) — I(X; Y\X) = 0, recovering the channel 
coding constraint R c < I(X;Y); assuming useless memories yields R c < I(X;Y) — I(X;Y) = 0; and 
intermediate cases place the system between these extremes. 

H. Rate region surfaces 

An equivalent way to characterize the sets !Z,lZi n and lZ ou t that will be useful in sections TVim and [Txl is 
to specify the boundary or surface of each region. For 1Z, the surface is 

max R c , where 
{R:ReR, R x = r x , R y = r y }. 

Similarly, by direct extension of theorems 16.21 and 16.31 the surfaces of lZi n and 7Z ou t are specified by 

r in (r x ,r y ) = max I(U; V) — I(U; V\XY) (5) 
rw(r x ,rg = max I(U; V) - I(U; V\XY), 

UVeC out (r x ,r v ) 

where 

C m {r x ,r y ) = {UV eV in : r x >I(U;X), r y > I(V;Y)} 
) = {UV e V ou t : r x > I(U; X), r y > I(V; Y)}. 



r(r x ,r y ) = 
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A useful alternative form comes from rewriting the right hand side of (jSJi as 



I(U; V) -I(U; V\XY) 
= I(U; V) - H(U\XY) - H{V\XY) + H(UV\XY) 
= I(U; V) - H(U\X) - H(V\Y) + H(UV\XY) 
= I(X; U) + I(Y; V) - I(XY; UV). 



(6) 



The second line follows from the Markov constraints U — X — Y and X — Y — V . Hence, 



max I(X; U) + I(Y; V) - I{XY; UV) 



(7) 



where the subscript * stands for in or out} and the maximization is over Ci n (r x ,r y ) or C ou t(r x ,r y ), 



In what follows we seek explicit formulas for ri n (r x ,r y ) and r out (r x ,r y ), which do not involve the 
optimization over the sets Ci n (r x ,r y ) and C ou t(r x ,r y ). 



In this section we study a simple case in which the alphabets for the training patterns and sensory data 
are binary, X = y = {0, 1}. The training patterns X = (Xi, . . . , X n ) are generated by n— independent 
drawings from a uniform Bernoulli distribution, X ~ B(l/2). Observations Y = (Yi, . . . , Y n ) are outputs 
of a binary symmetric channel with crossover probability q 



where q — l — q. Equivalently, we can represent Y as Y = X © W, where W ~ B(q) and is independent 



We now propose explicit formulas for ri n (r x , r y ) and r out (r x , r y ) in this binary case. Our formulas involve 
the following two functions. First, define 

g( r x,r y ) = l-h(q*q x *q v ), 

where q x and q y are specified implicitly by 

r x = 1 - h(q x ) 
r y = l-h(q y ); 

h(-) is the binary entropy function h(x) = — xlog(x) — (1 — x) log(l — x); and q x , q y £ [0, 1/2] to ensure 
that h(-) is invertible. Next, let g*(r x ,r y ) denote the upper concave envelope of g(r x ,r y ), 



and each variable in the optimization is restricted to the unit interval [0, 1]. As explained in Appendix IV III 
in both the binary case and the corresponding Gaussian case considered in the next section, the expression 
for the convex hull of the inner bound simplifies to 



respectively. 



VIII. Binary case 




of X. 



9*{rx,r y ) = sup 9g{r Xl ,r yi ) + 0g(r X2 ,r y2 ) 7 
where 9=1 — 6. The supremum is over all combinations (9, r Xl , r Vl , r X2 , r V2 ) such that 

yx i^y) 6[T X ^ , Ty 1 ) 4~ 9{t X2 , Ty 2 ) , 



9*(r x ,r y ) = sup 9g(r' x ,r' y ), 



and the supremum is over all combinations (9,r' x ,r' ) such that 



(r x ,r y ) = 9(r' x ,r'). 



Conjecture 8.1: In the binary case the surfaces of 7£j„ and lZ ou t are 



r ut(r x ,r y ) = g*(r x ,r y ). 
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0.05 




(c) 



Fig. 3. Surfaces of the binary inner bound z = Ti n (a) and outer bound z = r ut (b) regions; and difference between the outer 
bound and inner bounds z = ri n — r ou t (c). In these plots the crossover probability q = 0.2. 

From Theorem 17.21 g*(r x ,r v ) is in fact achievable. Thus, if the conjecture on the outer bound is true, 
then there is no gap between the inner and outer bounds, and g*(r x ,r y ) defines the achievable rate region. 
Figure shows the inner and outer bounds and their difference. 

To establish these conjectures we must prove both the 'forward' inequalities ri n > g, r out > g* , and the 
'backward' inequalities ri n < g, r ou t < 5*- The backward inequalities remain to be proven, whereas the 
forward inequalities can be proven by relatively straightforward constructions, as we now show. 

Proof: (ri n (r x ,r y ) > g(r x ,r y )) Let W x ~ B(q x ), W y ~ B(q y ) be binary random variables indepen- 
dent of X and Y, and define 

U = X®W X 

V = Y®Wy. 

The pair UV is obviously in Vi n . Furthermore, 

I(X;U) = H(X)-H(X\U) 

= 1-H(U®W X \U) 

= l-H(W x ) 

= l-h(q x ), 

I(V;Y) = H(Y)-H(Y\V) 

= l-H(V®W y \V) 

= l-H(W y \V) 

= 1 - h i%)i 

I(U;V) = H(V)-H(V\U) 

= 1- H{U ®W X ®W ®W V \U) 

= l-h(q x *q*q y ). 
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Setting r x = I(U;X) = l-h(q x ), and r y = I(Y; V) = l-h(q v ), we have UV <E V m and UV 6 C(r x ,r y ). 
Hence, 

rin{r x ,r y )= max I(E7; V) > 1 - h(q * q x * q y ) = g(r x , r a ). 

■ 

Proof: (r ou t(r x ,r y ) > g*(r x ,r y )) Using the same construction as in the forward proof for the inner 
bound formula, define two pairs of random variables (E/1V1), (U2V2) G Vi n C Pout such that 



X1 = /(C/ i; X) = l-/i(g xl ), 

W1 = I(V 1 ;Y) = l-h(q yi ), 

r X2 = I(U 2 ;X) = l-h(q X2 ), 

r y2 = I(V 2 ;Y) = l-h(q V2 ). 

Let (r x ,r y ) = 9(r Xl ,r yi ) + 9(r X2 ,r V2 ), 9 e [0,1]. Since r out (r x ,r y ) is convex, we have 

fout (j"x ? Ty ) ^ out \f xi 1 Tyi ) ^^out (^£2 •> ^3/2 ) 

> dg{r xi ,r yi ) + 9g(r X2 ,r y2 ). 

The inequalities above hold for all valid choices of 9,r Xl ,r X2 ,r yi7 r y2 , hence r out (r x ,r y ) > g*(r x ,r y ), 
as desired. ■ 

IX. Gaussian case 

We now consider a Gaussian version of our problem. Let X and Y be zero-mean Gaussian random variables 
with correlation coefficient p xy . In parallel with our discussion of the binary case, we propose explicit 
formulas for the surfaces of TLi n and lZ ou t for the Gaussian case, this time in terms of the following two 
functions. In both formulas, let 

r x = ~^log(l - A™ 2 ) 



--log(l - p yv 2 ). 



1 

Note that these expressions determine the correlation coefficients p xu and p yv . Define 

G(r x ,r y ) = -- log(l - p xy 2 p V v 2 Pxu 2 )- (8) 

and 

G*(r x , T V ) = r x +r y + ± log[l + ^ ~f ], (9) 



where 



7 = PxyPxuPyv, (10) 

a 2 , 2 _ r-\ _ 2\ 2 2 

P /^icu T" P y v \1 Pxy )Pxu Pyv ; 



//3 xJ 



Conjecture 9.1: In the Gaussian case the surfaces of 7^ n and 7£ ut are 

?"m i ) — G{t x i Ty ) 

r ut(r x ,r y ) = G*{r x ,r y ). 

Figure |5] shows plots of the inner and outer bounds and their difference, as well as the difference between 
the outer bound and the convex hull of the inner bound. Interestingly, unlike the binary case, for the 
Gaussian case the outer bound is not equal to the convex hull of the inner bound. 
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The following proof relies on some basic properties of the mutual information between Gaussian random 
variables, given as Lemmas in Appendix IVIIII 





Fig. 4. Surfaces of the Gaussian inner bound z = Ti n (a) and outer bound z = Tout (b) regions; and differences between the outer 
bound and inner bounds z = Ti n — r ou t (c) and between the outer bound and the convex hull of the inner bound z = r ou t — H(ri n ) 
(d). In these plots p xy = 0.8. 



In the analysis that follows, we assume that the true distributions are Gaussian. Under this assumption, we 
solve the inner and outer bounds. If the true distributions are Gaussian, then our conjecture is true. 

Proof: (ri n (r x ,r y ) — G(r x ,r y )) As noted in Appendix IVIIII mutual informations between jointly 
Gaussian random variables are completely determined by their correlation coefficients. For a length-4 
Markov chain U — X — Y — V of jointly Gaussian random variables I(U; V\XY) = and, applying 
Lemma HOI from Appendix IVUll we have p uv = p xu p X yPyv, hence 

I(U; V) - I(U: V\XY) = ~ log(l - p xu 2 p xy 2 p yv 2 ). 

This mutual information is maximized when the constraints I(X;U) < r x , I(Y;V) < r y are satisfied 
with equality, hence when p xu and p yv satisfy r x = — | log(l — p xu 2 ) and r y — — | log(l — p yv 2 ). This 
proves the theorem. ■ 

The following proof for the surface of the outer bound region uses the form of r ou t(r x ,r y ) in (0. We 
assume that the constraints on r x and r y are satisfied with equality. In this case, the optimization problem 
reduces to that of minimizing the I(XY; UV) subject to the length-3 Markov constraints U — X — Y, 
X — Y — V. Proof: (r out (r x , r y ) — G*(r x , r y )) Using Lemma liOl from appendix IVIIII we have 



xy,uv 



Pxu 


Pxv 




1 Pxy 




Pxu 


Pyu 


Pyv 




Pxy 1 




Pxu 
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The left hand matrix in this decomposition is C xy-xy , denoted hereafter simply as C, and we denote the 
righthand matrix by D. Then applying Lemma l8~TI from appendix IVIIII yields 

I(XY;UV) 

= o l^l _ 2 ^ ^ ~ C X y, UV C u ^ uv C UV ,y X \ 

= \ log \C\ - \ log |C - CDC- V \ UV DC\ 

= -\ log \C\ - \ log IC- 1 - DC-] UV D\. 
Substituting for the 2x2 matrices in this last expression and rearranging terms yields 

I{XY-,UV) = -\\o g [l+ 2p ™y-f l 

^ -1 Puv 

where 7 and (3 are defined in fllOb . 

By assumption, p xu and p yv are being held fixed, so we are optimizing I(XY; UV) only with respect to 
p uv . Setting dI(XY; UV)/dp uv = and solving, we obtain that, if [3 > 2-f > 0, then the maximum is 
achieved at p uv * = p, where p is defined in JlOi . 

To complete the proof we must show that j3 > 2-f > 0. Noting that /3, 7 > and substituting, the desired 
inequality becomes 

2 1 2 2 2 > rt 2 2 2 

Pxu 1 Pyv Pxu Pyv ^ ^PxyPxuPyv Pxy Pxu Pyv • 

Subtracting 1 from each side and factoring yields the equivalent inequality 

-(1 ~ /Oxu 2 )(l - Pyv 2 ) > -(1 - PxyPxuPyv) 2 ■ 

To show that this holds for all p xy , note that the maximum of the right hand side is achieved by p xy = 1, 
so that the inequality becomes 

(1 - Pxu 2 )(l - Pyv 2 ) - (1 - PxuPyvf < 0. 

This inequality holds, since 

(1-Ar« 2 )(l - Pyv 2 ) - (1 - PxuPyv f 

1 Pyv Pxu ~l~ Pxu Pyv [1 2p xu p yv -\- p xu pyv ] 

Pxu Pyv ~\~ 2p xu py V 
— {Pxu Pyv){Pyv Pxu) 
(Pxu Pyv) 

< 0. 



Appendix I 
Proof of the inner bound 

In this section we prove the inner bound lZi n C 1Z, theorem 16.21 The proof relies on standard random 
coding arguments and properties of strongly jointly typical sets [?]. Given a joint distribution p(xyuv), the 
strongly jointly (5-typical set is defined by 

N (xyuv\xyuv) 



T, 



UVXY 



xyuv 



p(xyuv) 



< S Mxyuv G XyUV 



where N (xyuv\xyuv) is the number of times the symbol combination xyuv occurs in xyuv. Likewise, 



we write e.g. 



T 5 



2 XY> 



XYU 



for singles, pairs, and triples. We will also use conditionally strongly jointly 



5-typical sets, for example 



T X V = {u : (xu) G n v ) 
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The subscripts are omitted when context allows. We will also need the fact that for any positive numbers 
S, e > 0, fixed vector x, and large enough n, 

2 -n[I(X;Y) +A < pr(xY e T S y) < 2 -n[I(X-Y)-A . (U) 

Proof: To begin, let R = (R c , R x , R y ) be any rate triple in 7£j„, and let e > be any positive 
constant. Then there exists a pair of random variables UV G V% n such that R G TZjjv- We wish to prove 
R G 7£. To this end, we will use ?7V to construct an (M x , M y , M c , n) pattern recognition code (/, <j>, g), 
with M c > 2 nR % M x < 2 nR % and M y < 2 nR y, such that P™ < e for a sufficiently large integer n. 

For concreteness, we will suppose that the mappings /, <j> and g are implemented in distinct memory, 
sensory, and recognition 'modules,' respectively, each of which 'knows' the joint distribution p(xyuv). 

Random codebook generation. To serve as codewords, select M x length— n vectors by sampling with 
replacement from a uniform distribution over the set Ty. Assign each codeword a unique index i G A4 X , 
where A4 X = {1,2, . . . , M x }. Denote the resulting codebook 

B„ = {u(l),u(2),...,u(M x )}, 

where the u(i) are the indexed codewords. 

Similarly, for the sensory module generate M y length-n codewords by sampling with replacement from a 
uniform distribution on Ty. Assign each codeword a unique index j G M y , where M y = {1,2,..., M y }. 
Denote the resulting codebook 

B„ = {v(l),v(2),...,v(M y )}, 
where the v(j) are the indexed codewords. 

Provide copies of both codebooks B u and B v to the recognition module. 

Memory encoding rule /. Let C x = {(X(l), 1), (X(2), 2), . . . , (X(M C ), M c )} be the set of labeled 
random patterns to be encoded into memory during the training phase. We define the memory encoder 
/ in terms of the following procedure. Given a labeled pattern (x(w),w), the memory module searches 
through the memory codebook B u for a codeword u such that (x(w),u) G T X w ^ suc ^ a codeword is 
found we denote it by u(w), and denote its index in the codebook B u by m{w). If B u has no codeword 
that is strongly jointly <5-typical with x(w), an error is declared and the label w is associated with the first 
codeword of B u . Denoting the event that the above procedure fails by E\ and its complement by Ef, let 

f(xM w) = ( ifEl occurs; 

j\x\w),w) I (m(w),w), if E{ occurs. 

An error is also declared if the above procedure results in assigning more than one pattern label to the 
same memory codeword; denote this second error event E%. The training phase corresponds formally to 
applying / to all M c patterns in C x , inducing the set 

C u - f(C x ) = {(m(l), 1), . . . , (m(M c ), M c )}. 

Note that not all of the codewords in B u have been used in the encoding procedure. Likewise, in the 
decoding algorithm described below, we need only consider the subset of codewords u G B u whose 
indices in B u also appear in C u . We denote the set of indices for these 'active' codewords C = C{C U ) = 
{m(l),m(2),...,m(M c )}. 

After training, reveal the memory codebook B u , the compressed data C u , and the mapping / to the 
recognition module. 

Sensory encoding rule 4>. The sensory encoding rule 4> is defined as follows. Let y be an input to the 
sensory module during the testing phase. The sensory module searches sequentially through the sensory 
codebook B v for a codeword v such that yv G Tyv ^ t ' le searcri succeeds, denote the found codeword 
by v(y) and denote its index by /i(y). If the search fails, declare an error, and let the sensory encoder 
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output be fi = 1. Letting E 3 be the error event and E% its complement, let 

, . _ J 1, if E 3 occurs; 
~~ \ (j,(y), if E$ occurs. 

Reveal the sensory codebook B v and the mapping <f> to the recognition module. 

Classifier: g\. We next specify g\, the first part of the classifier g — gi ° 92- Upon receiving the index 
/i = jit(y) from the sensory module, the recognition module retrieves the /i-th codeword v(y) from the 
sensory codebook B v , then searches the 'active' portion of the memory codebook B U (C) C B u for a 
codeword u such that uv(y) G 7^ v . If such a u exists and is unique, denote it by u = u(/z) and its index 
in the codebook B u by rh = m(u). If no such u exists, declare an error, £4; if more than one such u 
exists, declare an error E$; and in case of either E4 or E$ let rh = 1. Thus, set 



.91 0) 



1, if either E4 or £5 or both occur; 
rh, if both £| and E% occur. 



Classifier: 52- After determining the codeword index m = gi(p), the recognition module searches the set 
of stored data C u for a pair (to, w) whose first entry is to = to and retrieves the associated class label. 
Note that if none of the errors Ei, i = 1, . . . , 5 occurs, there pair (to, w) is in fact unique. If there is more 
than one such pair, then to ensure uniqueness choose the first. Denoting the retrieved label by w, let 

g 2 (m,C u ) = w. 

Analysis of the probability of error 

We now show that the probability of recognition error using the code (/, <fi, g) n developed above vanishes 
as n^too. The following list qualitatively describes all possible sources of error using the code (/, <f), g) n : 

E The sensory data is too ambiguous- i.e. it is not strongly jointly typical with the training pattern; 

Ei The training pattern is unencodable; 

£2 Two or more training patterns are associated with same memory codeword; 

£3 The sensory data is unencodable; 

E4 The encoded sensory data matches no codeword in memory; 

E 5 The encoded sensory data matches one or more incorrect memory codewords. 

More formally, the possible errors are 

E = {( X (w),y)tT* Y } 

Ei = £ c n(n{(x,u(i))^Ti c/ }} 
e 2 = (n^jnj U {W»VW)e4} 

\n=0 / (x(w')eC x , w'jiw 
' M„ 



E 3 = E c n ^f){(yM*))tTYv} 
E 4 = [ f]E^j n{(xW,uW,y,v(y)) ttfxYV,} 

\n=0 / 



4 



E 5 = [f]E c n \nl |J {(u(m')My))€TS v }}, 

\n=0 / I u(m')eB u , m'EC* 



where in the last line the set C* includes all indices in C except m(w), i.e. C* = C\ m(w). The average 
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total probability of error is upper-bounded as 

U=lJ £=0 

Hence to show PJ 1 < e it suffices to show that each term in the sum vanishes as n—>oo. 
Encoding Errors 

Error event E : By the Asymptotic Equipartition Property, Pr(E )^Q [?]. 

Error event E\. For E±, we use the well known fact that if R x > I(X; U), then the M x — 2 nRx codewords 
in B u are sufficient to cover the pattern source X. Explicitly, let R x — I(X;U) + a, for any a > 0. Then 
for any e > and sufficiently large n, 

Pr(Ex) = Pr{Ei\x)Pr(x)Pr(y\x) 

< ^ Pr(E!\x)Pr{x.) 

= ^{l-Fr(xUeT 5 |x)} M ^r(x) 

< {1 - 2~"[ / ( X;,7 )+ Q / 2 ]} M - 



< 2- 



n na/2 

2~ 2 



< 



where (a) is due to the property of strongly jointly typical sets in equation an d m (b) we have used 
(1 - af < 2- af} . Hence, Pr(E x )^Q. 

Error even? E2: Conditioned on Eq fl Sf, we have u(u>) e 7^. The sequences X(iu') £ C Xl w' ^ w are 
generated independently of u(u>). Thus 

TO) = E Pr(Xu(w)€^|u(«>)6^ 

x(m')ec x ,«)'^to 

^ 2n_R c 2 _ n/(JC;t/)+ne 

< e 

for large enough n, since i? c < J(t/; V) < I{X\ U) under the Markov assumption U — X — Y — V. Hence, 
P(£ 2 )-0. 

Error event E3: By a covering argument analogous to the one used in the analysis of event E±, having 
M y > 2 nIi X'- v ' ) codewords in C v is sufficient to ensure P(P 3 )^0. 

Decoding errors 

Error event E4: To analyze the probability of event E4, we invoke the following uniform version of the 
well-known Markov Lemma [5], [17], [18], [30]. 

Lemma 1.1: Let A — B — C be a Markov chain; let ab G T^ B ; let C be chosen from a uniform distribution 
over Tff c ; and let e > be any positive constant. Then Pr (abC ^ Xabc) — e f° r n sufficiently large. 

To bound the probability of event £4, we condition on nf =0 Pf and apply the Markov lemma twice in 
succession to establish the following two claims: 

i) Pr(xyV(y) $ T* YV \ X y E T s ,V(y) E T y s v ) < e 

ii) Pr(U(u;)xyv(y) £ T£ YUV \ X yv(y) E T s ,U(w) E T? v ) < e 
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To prove (i), note that the conditions of the Markov Lemma can be satisfied making the following 
substitutions in the Lemma: (a, b, C)— »(x, y, V(y)). Similarly, to prove (ii), put (a, b, C)— >(yv, x, U(w)). 
Combining (i) and (ii), we conclude that Pi^E^j—^O. 

Error event E§: 

Given f] n=0 E%, we have v(y) € Ty. The sequences U(m') £ B u , m' E £* = £ \ m(w) were generated 
independently of v(y). Thus 

P(E 5 ) = Y, Pr(XJ(w')v(y) eT uv W(y) eT v ) 

U(m')eB„,m'e£* 

< e 

for large enough n, since R c < I(U ; V). Hence, P(E^)^0. 

We have constructed a rate R code for which P™ < ^™=o P r {En}—*0- Consequently, R 6 1Z, completing 
the proof. 



Appendix II 
Proof of the outer bound 

In this section we prove theorem 16.31 which states the outer bound 1Z C !Z out . In the proof let W be the 
test index, selected from a uniform distribution p(w) over the pattern indices A4 C ; let X = X(VP) be the 
selected test pattern from the set of training patterns C x ; let m = m(W) be the compressed, memorized 
form of X computed from / as (to, W) = /(X, W); let C u = f(C x ) be the memorized data; let Y be 
the sensory data; and let fi = /J,(W) — 0(Y) be the encoded form of sensory data. Note that m and /i 
are random variables through their dependence on X and Y. The mutual informations in the proof are 
calculated with respect to the joint distribution over (W, C x , C u , X, Y, to, //, ii). We can verify that this 
distribution is well-defined by writing it out explicitly: 

p(w, C x ,C u ,x, y, to, n) = p(w)p{C x )p(C u \C x )p{x\w, C x )p(y\x)p(m\x, w)p([i\y), 

where 



p(w) 
p(C x ) 
p{C u \C x ) 
p(x.\w,C x ) 



w e M c , 

otherwise; 




x = x(to), (x, w) e C x 
otherwise; 



p(yi x ) = n^^i^) 

p(m\x, w) = 



1 f(x,w) = (m,w) 
otherwise; 



p(v\y) 



m = ^(y) 

otherwise. 
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p(w\fi,C u ) = 



1 w = g(fj,,C u ) 
otherwise. 



The independence relationships underlying the structure of this distribution are clear from the block diagram 
of figure^ They are also usefully displayed using a directed graphical model ('Bayes' net') [9], [16]. 




Fig. 5. Independence relationships for (W, C x , C u , X, Y, m, fi, w) 



Proof: (Theorem 16. 3t 

Assume R = (R x , R y , R c ) G 7Z. Then there exists a sequence of (M x , M y , M c ,n) codes (f,cf>,g) n , such 
that for any e > 0, 

M c > 2 nR " 
M x < 2 nR * 
M y < 2 nR y 

and P™ = Pr(W ^ W) < e. To show that R e 7Z ou t, we must construct a pair of auxiliary random 
variables UV such that UV G Pout and R e IZuv- 

We construct the desired pair UV in three steps: (1) We introduce a set of intermediate random variable 
pairs UiVi, i = 1,2, ... ,n, individually contained in V ou t\ (2) we derive mutual information inequalities 
for R x , R y , and R c involving sums of the intermediate variables; (3) we convert the sum inequalities into 
inequalities in the final pair UV by applying Lemma |2~T1 

Step 1: 

Let the intermediate auxiliary random variables be 

Ui = (m^X*- 1 ) 
Vi = faY*- 1 ), 

for i = 1, 2, . . . , n. Each pair is in V Q ut- This is verified for the U{ by calculating 

IiUi^Xi) = H{Y i \X i )-H{Y j \m,W,X i - 1 ,X i ) 
= H{Y i \X i )-H{Y i \m,W,X i ) 

< H{Yi\Xi) - H(Y l \m, W,X n ) 
I H(Yi\Xi) - H{Yi\X n ) 
= H(Yi\Xi) - H(Yi\Xi) 
= 0, 

where the reasons for the lettered steps are (a) conditioning reduces entropy, (b) the Y{ are independent of 
all other variables given X n , and (c) the pairs XiYi are i.i.d. Hence, Ui — Xi — Yi is a Markov chain. By 
a similar argument, Xi — Yi — Vi is also a Markov chain. Hence, UiVi G V ut f° r eacn i — \,2, . . . ,n. 

Step 2: 
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First, 

M c (nR x ) > M c log M x 
> H{C U ) 



or 



H(C U )-H(C U \C X ) 
% (Cu 5 ^a? ) 

H(C X ) - H(C X \C U ) 



[H(X n (w), w) - H(X n (w), w\m(w), w)\ 

w — l 

M c 

J2[H(X n (W)\W = w)- H(X n (W)\m(w),W — w)] 

w — l 

^2[H{X n (wj) - H(X n (W)\m{w),W = w)] 

w — l 

M c 

J2[H(X n ) - H(X n \m, W = w)} 

W—l 

M c n 

- H(Xi\m, W = w^ 1 - 1 )} 

w — l i—1 

n M c 

M c Y / J2[H{X l ) - H{X l \m,W = w, 

1=1 W — l 

n M c 

M c Y^[H(X l ) - p(w)H(Xi\m, W = w, X^ 1 )] 

i—1 w — l 

n 

M c J2[H(Xi) - H(Xi\m, W, X 1 ' 1 )] 

i=l 

n 

M c J2[H(Xi) - H(Xi\Ui)] 

i=l 

n 

M c Y,I(Xi;Ui), 



nR x >J2l(Xi;Ui), 
»=i 

where the justifications are (a) C u = f(C x ); (b) the pairs (X n (w), w) are independent; (c) in this expression 
w is a deterministic variable (i.e. H(w) = H(W\W = w) = 0); (d) the X n (w) are i.i.d. and independent 
of their index w; (e) to simplify notation, we have written m = m(w), X n = X n (w); (f) the Xi are i.i.d.; 
and (g) W is distributed according to p(w) = l/M c , w = 1, 2, . . . , M c . 

Next, 



nR y > H{n) 

n 

1=1 
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n 



Step (a) follows from /i = <f>{Y n ). 
Finally, 



nR c < log M c 

= H(W) 

= I(W;C u ,fj,)+H{W\C u ,^ 

< I(W;C u ,n)+ne n 

= I(W;C u ) + I(W;fi\C u )+ne ri 

= + I(W;fi\C u )+ne n 

= I(W,C u ;fi)-I(fj,;C u ) + ne n 

< I(W,C u ;n) + ne n 
= I(W, m; fjt) + ne n 



J2 Ui) + KXi-Vi) - I(X t Y t] UiVi) + 2ne r , 

i=l 
n 

I(Ui\ Vi) - I(Ui;Vi\X t Vi) + 2ne n , 



i=l 

The lettered steps are justified as follows. 

(a) By assumption, Pr(w ^ W) = P™ — > 0, where w — g n (fi,C u ). Thus, applying Fano's inequality 
yields 

H(W\C U , M ) < H(P?) + P? log(M c - 1) < ne n , 

where e n ^0. 

(b) The test index W and patterns C x are drawn independently, hence W and C u = f(C x ) are independent 
and I(W)C U ) = 0. 

(c) Writing C u = C u * U {(to, W)} , C u * = C„ \ {(m, W)}, we have 

I(W,Cu,v) = I(W,(m,W),C u *;iJ,) 

= I(W,m;fx)+I(W,C u *;fj l \W,m) 

= I(W,m)fj,) + I(C a *) t i\W,m) 

= I(W, m; n) + 0, 

since the (m(i),i) are independent of /j, for i 7^ W. 

(d) To justify this step we invoke the following two results, proved in Appendix IV1I Let A,a,B,(3, and 
7 be arbitrary discrete random variables. Then: 

Theorem 2.1: 

1(a) 13) > I(A; a) + 1(B) 0) - I(AB; a@), 
with equality if and only if I(Aa; B(3) = I (A; B). 
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Theorem 2.2: Let Zi — (7; A 1 1 ), i — 1, 2, . . . , n, where the Aj are i.i.d. Then 

n 

*£l(A i ;Z i )=I(A n ; 7 ). 

i=l 

To apply Theorem 12.11 make the substitution (a, f3, A, B)^(mW, /i, X n , Y n ). Then the condition for 
equality is satisfied: 

I(X n ,m,W;Y n ,fi) = I(X n ,W;Y n , p) + I(m,W;Y n , [i\X n ,W) 
= I(X n ,W;Y n ,fi) + 
= I{X n ,W;Y n )+I{X n ,W;fi\Y n ) 
= I(X n ,W;Y n ) + 
= I(X n ;Y n )+I(W;Y n \X n ) 
= I(X n ;Y n ) + 0, 

since (a) (m, W) = f(X n , W), (b) n = <j)(Y n ), and (c) Y n only depends on W through X n = X n (W), 
so that H(Y n \X n , W) = H(Y n \X n ). Thus Theorem ITT1 yields 

I(m, W; ft) = I(X n ; m, W) + I(Y n ; fi) - I(X n , Y n ; m, W, fi). (12) 



Next, apply Theorem 12.21 three times with the substitutions: 

(Zu^A*- 1 ) {U^mW.X 1 - 1 ) 

-► {UiVumW^X^Y*- 1 ), 



to obtain 



^I{Xi;Ui) = I{X n ;m,W) 

i=l 
n 

^KXaVi) - 7(y";/i) 

i=l 

n 



Adding the first two expressions and subtracting the third yields 



^[/PQ; Ui) + - JpQ, y i5 U h V-)] = [7(^ n ; m, W) + I{Y n ,^) - 7(X", Y n ; m, W, /i)]. (13) 

i=l 

Combining dl 2i and Jl 3i yields 

n 

J(m, W;^) = ^ £/;) + I(Y i; Vj) - /(X,, F i5 (7,, V,), 
»=i 

as claimed. 
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(d) This step is justified by the following chain of equalities: 

I{X t ; Ui) + I(Y t ; V) - I{X u Yi; U h V) 

= H(Ui) - H(Ui\Xi) + H(Yi) - H(Vi\Yi) - [H(UM) - ff(EW|X<y<) 
= [H(Ui) + H(Vi) - HiUiVi)} - [H(Ui\Xi) + H(Vi\Yi) - HiU^X^)} 
= I(U i; Vi) - [HfPilXiYi) + H{Vi\XiYi) - HiUiVilXiYi)] 
= I{U i \V i )-I{U i ;V i \X i Y i ), 

for each i — 1,2, ... ,n, where in the second-to-last step we have used the fact that U{ — Xi — Y% and 
Xi — Yi — Vi are Markov chains for i = 1, 2. . . . , n, as shown above in Step 1. 

Step 3: 

For this step we use the following Lemma, proved in Appendix Mil 

Lemma 2.1: Suppose f7,-Vj £ Pout, i = 1,2, ... ,n. Then there exists UV £ Tout such that 

1 " 

-Y^I{Xi-Ui) = I(X;U) 
n * — ' 

i— 1 
1 - 

-J2l(Y t ;Vi) = I(Y;V) 

2 — 1 

1 " 

-V I(Ui-,Vi) - I(Ui;Vi\XiYi) = I(U;V)~I(U;V\XY) 



n . 

7=1 



Applying Lemma I27TI to the results of steps 1 and 2, we obtain 

1 ™ 

Rx > -Y / I(X l ;U t ) = I(X;U) 



n 

4=1 



1 - 

R y > -Y j I{Y i -V i )=I{Y;V) 

i=l 

1 ™ 

R c < ~Y I(Ui',Vi) -I(Ui; Vi\XiYi) 

71. ' 



n 

i=l 



= I(U;V) - I(U;V\XY) 

where UV E Tout- With respect to this UV, by definition we have R £ IZuv- Hence, R £ IZout, and the 
proof is complete. 



Appendix III 
Convexity of the outer bound 



In this Appendix we prove a slightly more general version of Lemma I27TI from section|IIj and demonstrate 
that the outer bound rate region lZ ou t is convex. 

In the following, let Q be any finite alphabet, and assume that we have pairs X q Y q for all q £ Q which 
are i.i.d. ~ p(xy). 

Lemma 3.1: Suppose U q V q £ P ut for all q £ Q, and let let Q ~ p(q),q £ Q be any discrete random 
variable independent of the pairs {X q Y q }. Then there exists a pair of discrete random variables UV £ P ou t 
such that 

52p{q)I(X q ;U q ) = I(X;U) 
qeQ 

5>(9) J (n;^) - I(Y; V) 
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Y,P(q)[I(U q ;V q )-I(U q ;V q \X q Y q ) = I(U;V) - I(U;V\XY). 
qeQ 

Remark 3.1 : Lemma I2~T1 in section ITIl follows immediately from the above Lemma, by choosing Q = 
{1,2,..., n} and p(q) = 1/n for all qeQ. 

Proof: As a candidate for the pair UV in the Lemma, consider U = {Uq, Q) and V = (Vq, Q), i.e. 

U = {U q if Q = q} 
V = {V q if Q = q}. 

To verify that UV G Tout, we proceed to check that U — X — Y and X — Y — V are Markov chains. 
By the assumption U q V q € Tout for each q € Q, we have I(U q ;Y q \X q ) = and I(V q ; X q \Y q ) — 0. Hence 

o = £p( 9 )i(cv,y,|x,) 
= ^p( g )7([/, ; y ? |i„Q = ? ) 

= /([/q;Fq|X q Q) 

= I(U Q ;Y\X,Q) 

= I{U Q Q;Y\X) - I{Q;Y\X) 

I I(U Q Q;Y\X) 

= I(U;Y\X), 

where in (a) we are able to drop the subscript Q on Xq and Yq because the X q and Y q are i.i.d. and 
independent of Q; and similarly (b) is because I(Q; Y\X) — 0, due to the independence of Q and Y . 
By an analogous calculation, we also find I(V;X\Y) — 0. Hence, U — X — Y and X — Y — V, and 
UV € Tout as desired. 

It remains to demonstrate the three equalities in the Lemma. For the first equality, we write 

I(X-U) = I(X-U Q Q) 

= I(X;U Q \Q)+I(X;Q) 

= I(X-U Q \Q) 

I I(X Q -U Q \Q) 

= ^p(g)7(X 9 ;C/ g ), 
qeQ 

where, as above, (a) and (b) follow from the facts that the X q are i.i.d. and independent of Q. Similar 
calculations yield 

I(Y;V)=J2p(q)I(Y q ;V q ), 

q<£Q 

which is the second required equality, and 

I(XY;UV) = J2p(q)I(X q Y q ;U q V q ). 

q&Q 

This last equality can be combined with the first two to yield the third required equality using 

I(X; U) + I(Y; V) - I{XY; UV) = I(U; V) - I(U; V\XY). 

which follows from the two short Markov chains U — X — Y and X — Y — V, as shown in subsection 
IVII-HI equation [6] The proof is complete. ■ 

The convexity of 7Z ou t follows readily from the preceding Lemma. 

Lemma 3.2: TZ ou t is convex. That is, let R q be any set of rates such that R q £ lZ out for all qeQ, where Q 
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is a finite alphabet, and let p{q) be any probability distribution over Q. Then R = J2q£QP(l)Rq T^out- 

Proof: Fix an arbitrary distribution p(q) and rates R g G lZ ou t for all g G Q. By the definition of 
lZ ou t, for each rate R g , there exists a pair [/ g V^ G 'P (ni t such that R g G lZu q v q - Consequently, 







> ^2p(q)I(X q ;U g ) 




q£Q 


q£Q 




= ^p{q)Ry. q 


> ^2 P (q)i(Y q ;V q ) 




q£Q 


gee 


Rc 


= J2p(l)Rc,q 

q£Q 


< ^p(q)I{U q ;V q ) - I(U q ;V,\U q V q ) 

q£Q 



As in the proof of Lemma Em use these pairs to construct a new pair UV, by defining U — (Uq,Q), 
V = (Vq, Q). From the proof of Lemma ITT1 we know (1) that UV G V uu an d (2) the sums on the right 
hand sides of the inequalities above can be replaced with expressions in U and V, yielding 

R x > I(X;U) 
R y > I(Y;V) 
R c < I(U;V) - I{U;V\UV), 

which means that R G IZuv f° r m e given UV. Hence, R = 5Z ge gP(9)Rq G lZ ou t- Since p(q) and 
R g G Ti-out were arbitrary, we conclude that lZ ou t is convex. ■ 



Appendix IV 
Proof of theorem I7.2I 

In this section we prove theorem 17.21 The argument is based on time sharing. Consider a sequence of 
codes of lengths rii that achieve (R c , R x , R y ). Corresponding to this sequence is a sequence of codes 
of lengths mi that satisfy Qmi — m, constructed as follows. For each mi, select any 8m,i components; 
reveal the indices of the selected components to the memory encoder and the sensory encoder. Use the 
corresponding code from the first sequence on these components, ignoring all other components. For rrii, 
there are 2 mi6Rc patterns, 2 mi6R:c memory states, and 2 rni0R y sensory states. 

The corollary 17.21 follows immediately from the inner bound, theorem [6.21 



Appendix V 
Proof of Lemma IT^I 

In this Appendix we prove the 'if-then' statement asserted in Lemma f7~4l 

The assumptions of the statement are that (a) V m i X = V ut\ an d (b) that the achievable rate region 1Z is 
convex. We wish to show that these imply 1Z = Tt out . 

From theorem l6~3l we have lZ ou t 2 To prove the Lemma, we must demonstrate the converse, lZ ou t C 1Z. 

It suffices to show the boundary points of lZ ou t are achievable. Let R = (R c , R x , R y ) be an arbitrary rate 
on the boundary of K out . Then there exists UV G V out such that R c = I(X; U) + I(Y; V) - I(XY; UV), 
R x = I(X; U) and R y = I(Y; V). In turn, assumption (a) V m ix = 'Pout implies that there exists Q ~ 
p( ( J , ))9 € Q independent of XY and pairs U q V q G Vi n ,q G Q such that R x = I(X;Uq,Q), R y — 
I(Y; V Q , Q), and R c = I{X; Uq, Q) + I(Y; Vq, Q) - I(XY; UqVq, Q). Hence, using the independence 
of Q from X and Y we have 

R x = J2l(U q ;X)p(q) 

q£Q 
qeQ 



2S 



Next, let R xq = I(U g ;X), R yq = I(V q ;Y), R cq = I{X;U q ) + I{Y;V q ) - I(XY;U q V q ), for q = 
1, 2, . . . , \ Q\. Then, by definition, each rate R q = (R cq , R xq , R yq ) is in 7Zi n . Since lZ in C 1Z by theorem 
IO R q e 1Z for each q e Q. 

According to the preceding argument, R = (R c , R x , R y ) is a convex combination of achievable rates. 
Consequently, if 1Z is convex as assumed, then ReK. Since the rate R was an arbitrary boundary point 
of TZ out , we conclude lZ out C 1Z, hence 1Z = 1Z out as desired. 



Appendix VI 
Proofs of theoremsEHandIZIiI 



Consider the elementary Shannon inequalities, stated in the following two Lemmas. The variables A, B, a, (3, 7, S 
appearing in the Lemmas denote arbitrary discrete random variables. 

Lemma 6.1: 

I(A; a) = I (A; a, 7) - I(A, a; 7) + I(a; 7). 

Proof: 

I(A;j\a) = I(A; a, 7) - I (A; a) 
= I(A,a;j) - 1(7; a). 

■ 

Lemma 6.2: 

I (A; a) + I(B; /?) = I(A; B) + I(a; fJ) - I(A, a; B, 0) + /(A, B; a, (3) 

Proof: 

I(A,a;B,f3)-I(A,B;a,f3) 

= H(A,a)+H(B,f3)-H(A,B)-H(a,f3) 
= -I(A; a) - I(B; (3) + I(A; B) + I(a; 0) 



Theorems 12.11 and 12 . 21 follow directly from the Lemmas above. 
Theorem 6.1 (Theorem \2.1\ : 

I(a; 13) > I{A; a) + I(B; 0) - I(A, B; a, 0) 
with equality if and only if I(A, a; B, 0) = I(A; B). 
Proof: Rearrange Lemma 16721 to get 

I(a; 0) = I(A; a) + I(B; 0) - I(A, B; a, 0) + [I(A, a; B, 0) - I(A; B)], 

The Lemma now follows readily from the preceding expression: We obtain equality in the Lemma if (and 
only if) the term in brackets is zero. Otherwise, the bracketed term is nonnegative, since 

I(A,a;B,0)-I(A;B) 

= H{a\A) + H{0\B) - H(a, 0\A, B) 

= H(a\A) - H(a\A, B) + H(0\B) - H(0\A, B, a) 

>0, 

where the inequality is due to the fact that conditioning reduces entropy. ■ 
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Theorem 6.2 (Theorem \2.2l : If f/j = (7, A 1 x ), then 

n n 

I{A n ■ 1 )=Y,I{A i ;U i )-Y,I{A i ; A*' 1 ) 

i=l z=2 

Proof: In Lemma RTT1 put A = Ai, a = A 1 ^ 1 . Note that U\ = 7. Hence, substituting and summing 
from 2 to n yields 

n n 

X; 7(^; A*" 1 ) = ^ /(Aj ;[/,)- 7(A" ; 7) + 7(A ; 7) 

z=2 i=2 
n 

= ^7(A i ;[/ i )-7(A n ; 7 ) + 7(A 1 ;C/ 1 ) 

i=2 
n 

= ^7(A i ;C/ i )-7(A"; 7 ). 



Appendix VII 
Simplification of convex hulls 

In this section we argue geometrically that the expressions for convex hulls of the inner bound regions 
simplify to just one term in both the binary and Gaussian cases. To discuss both cases simultaneously, let 
us represent the surface of either inner bound by a positive valued function / : V — > R + . Here, I? is a 
square region 

D = {r = (i,i/)eI 2 :0<3;<M, < y < AT}, 

and M is a positive constant. In the binary case, f(r) — g(r), and D = [0, 1] x [0,1]; in the Gaussian 
case, fir) = G(r), and D = [0, 00) x [0, 00). Some important properties shared by both cases are that for 
all r = (x, y) 6 V, 

f(x,y)>0, f(0,y) = f(x,0) = 0, 
fx(r)J v (r) > 0, f xx {r), f vy (r) < 0, 

where the subscripts denote partial derivatives. 

Denote the convex hull of f(r) by c(r). Generically, the boundary of the convex hull is 

c(r)=max0/(ri) + 0/(r2), 

where the maximum is over all triples (9,r\,r2) such that r = Qr\ + 9r2, G [0,1], and V\,r% G V. 
However, as argued next, for the cases under study this simplifies to 

c(r) = max 6f(r), 

where r = Or 1 . 

The convex hull of a surface can be characterized in terms of its tangent planes. Given any point r' = 
(x,y) G T>, if its tangent plane lies entirely above the surface, then (r',f(r')) is on the convex hull. If 
the tangent plane cuts through the surface at one or more other points, then (r, f(r)) is not on the convex 
hull. If the tangent plane intersects the surface at exactly two points, then both points are on the convex 
hull. 

The tangent plane at an arbitrary point r' = (x' , y') G T> is the set of points satisfying 

z(x,y) = f x (x - x') + f y (y - y') + z' , 

where the partial derivatives are evaluated at r', i.e. f x = f x (r'), f y = f y (r'), and z' = f(r'). The tangent 
plane intersects the z = plane in a line. Setting z(r) = and solving 

y = mx + b, where 



30 



m = -{fx/fy) 
b = 1/fvWU + y'fy-z'). 

Since f x ,f v > 0, the slope m = —(f x /fy) is negative. This line intersects the positive orthant whenever 
the intercept b > 0, in which case the tangent plane cuts through the surface, since / > 0. Thus, the only 
points on the original surface f(x,y) that can be on the convex hull are those for which b < 0. 

Next consider any path through T> along a line segment y = ax, a > 0, starting from one of the 'outer 
edges' of V, where x — M ovy ~ M, and consider what happens to the tangent plane's line of intersection 
t with the z = plane as we move in along the path toward the origin (0, 0). Initially, the tangent planes 
lie entirely above the surface, and the intercept of I is negative, b < 0. This intercept increases along the 
path until b = 0, at which point I intersects (0, 0). Here, the tangent plane contains a line segment attached 
on one end to the point of tangency, and at the other end to the point (r, f(r)) — (0,0,0); everywhere 
else, the tangent plane is above the surface. Continuing toward the origin, all other points along the path 
have tangent planes such that I has a positive intercept b > 0, hence these points are excluded from the 
convex hull. 

These considerations imply that the convex hull c(r) is composed entirely of two kinds of points. First, 
points which coincide with the original surface, c(r) = Of(r), with = 1. These points occur at values of 
r = (x, y) 'up and to the right' of (0, 0). Second, points along line segments connecting surface points 'up 
and to the right' (V, /(/)) with the point (r, f(r)) = (0, 0, 0), that is c(r) = 0/(r') + 0/(0, 0) = 0/(r'), 
where r = Or' and 6 G [0, 1]. Hence, for all r£P, c(r) has the desired form. 

An example of another function that behaves in the same way just described is f(x, y) = (1 — (1 — x) 2 ) (1 — 
(1 - y) 2 ), with V = [0, 1] x [0, 1]. 



Appendix VIII 
Properties of Gaussian mutual information 

Our analysis of the Gaussian pattern recognition problem relies on well-known results, stated below without 
proof. 

Lemma 8.1: The mutual information between two Gaussian random vectors X and Y depends only on 
the matrices of correlation coefficients. Specifically, 



J(X;Y) 



i log (dct C x>x ) - i log (det C XiX \ y ) 



where 



In the most well known special case of Y = X + W, where X and W are independent Gaussian random 
variables with variances P and N, respectively, yields 



1 



/(x ; y) = -io g (i + - 



p 



-log 



where the correlation coefficient p XiV = y/P/(P + N). 

Lemma 8.2: If X, Y and Z are zero mean Gaussian random vectors that form a Markov chain X — Y — Z, 
then 

Note that for dimension one, X — > Y — > Z implies p x , z = p x , y p y , z . 

Lemma 8.3: Let X, Y, U, and V be jointly Gaussian random variables such that U — X — Y and X — Y — V 
are Markov chains. Then the matrix of correlation coefficients C xytUV decomposes as 





C 



xy,uv 



1 

Pxy 



Pxy 
1 



Pxu 





Pyv 
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This lemma follows immediately by using Lemma HOI to obtain the substitutions C XtV = C x , y C y yC y<v = 

PxyPyv and C U y — C u x C x x C X y — PuxPxy- 
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